regression, except now, there are more equations to solve. In multiple as in straight-line regression,
you can also get the information you need to estimate the standard errors (SEs) of the parameters.
Executing a Multiple Regression Analysis in
Software
Before executing your multiple regression analysis, you may need to do some prep work on the
variables you intend to include in your model. In the following sections, we explain how to handle the
categorical variables you plan to include. We show you how to examine these variables through
making several charts before you run your analysis. If you need guidance on what variables to consider
for your models, read Chapter 20.
Preparing categorical variables
The predictors in a multiple regression model can be either numerical or categorical (Chapter 8
discusses the different types of data). In a categorical variable, each category is called a level. If a
variable, like Setting, can have only two levels, like Inpatient or Outpatient, then it’s called a
dichotomous or a binary categorical variable. If it can have more than two levels, it is called a
multilevel variable.
Figuring out the best way to introduce categorical predictors into a multiple regression model is
always challenging. You have to set up your data the right way, or you’ll get results that are either
wrong, or difficult to interpret properly. Following are two important factors to consider.
Having enough participants in each level of each categorical variable
Before using a categorical variable in a multiple regression model, you should tabulate how many
participants (or rows) are included in each level. If you have any sparse levels — row frequencies in
the single digits — you will want to consider collapsing them into others. Usually, the more evenly
distributed the number of rows are across all the levels, and the fewer levels there are, the more
precise and reliable the results. If a level doesn’t contain enough rows, the program may ignore that
level, halt with a warning message, produce incorrect results, or crash. Worse, if it produces results,
they will be impossible to interpret.
Imagine that you create a one-way frequency table of a Primary Diagnosis variable from a sample of
study participant data. Your results are: Hypertension: 73, Diabetes: 35, Cancer: 1, and Other: 10. To
deal with the sparse Cancer variable, you may want to create another variable in which Cancer is
collapsed together with Other (which would then have 11 rows). Another approach is to create a
binary variable with yes/no levels, such as: Hypertension: 73 and No Hypertension: 46. But binary
variables don’t take into account the other levels. You could also make a binary Diabetes variable,
where 35 were coded as yes and the rest were no, and so on for Cancer and Other.
Similarly, if your model has two categorical variables with an interaction term (like Setting +
Primary Diagnosis + Setting * Primary Diagnosis), you should prepare a two-way cross-tabulation of
the two variables first (in our example, Setting by Primary Diagnosis). You will observe that you are
limited by having to ensure that you have enough rows in each cell of the table to run your analysis.
See Chapter 12 for details about cross-tabulations.